Voice generation: Prosody transfer

Demo

By Yolanda Gao

Voice generation involves the creation of voices that best conform to supportive evidence, such as the structure of the skull, or the 3D image/scan of a face. Often, the supportive evidence may not have the information needed to generate every aspect of the voice signal, including language, content, style of delivery etc. This is especially true when the evidence is merely the image of a face or the structure of the skull. As such, there are many challenges associated with this process, each of which requires deeper consideration and different sets of approaches to address and solve.

One such challenge is that of rendering the correct intonation or prosody onto the generated (or synthesized) voice signal. Of the many solutions we have explored for this, a good one seems to be that of using a “style of rendering” or prosody from an exemplar voice sample, learned from a database of renderings by humans, and “transferring” it to the generated voice signal. Note that in this case, the goal is to simply emulate a prosody -- the content of the generated signal may be different from that of the exemplar.

In the examples below, we show the results of one such mechanism that we have devised to transfer prosody from learned exemplars to the generated signal. The first set labeled “references” shows sets of 5 exemplars derived from different databases. We attempt to then "lift" the prosody from these exemplars.

The two sets of examples that follow demonstrate the results of this process on signals that have the same linguistic content, and those that have different content (which is our final goal).

1. Derivation of prosody from different datasets

Different prosodies are automatically factorized from the training dataset. Each of these factorized "prosodies" can then be tranfered to a test set.

1.1 Different prosodies derived from VCTK dataset

Uterrance: "I’ve felt the chance that I have a number of options."

In [6]:
    import sys 
    sys.path.append('/usr/local/lib/python2.7/site-packages')
    from scipy.io import wavfile 
    import IPython
    import IPython.display as ipd
    import numpy as np
    
        
    # Token examples
    dirn = './Tokens/'
    # Firstdataset
    path = dirn + 'VCTK/'

    for indx,i in enumerate([0,1,3,4]):
        
        print('Style'+ str(indx+1)+ ': ')
        Bsrc_path = path + 'token' + str(i+1) + '_16bitPCM.wav'       
        fs, src_waveform = wavfile.read(Bsrc_path)
        IPython.display.display(ipd.Audio(src_waveform, rate=fs))
        
    
Style1: 
Style2: 
Style3: 
Style4: 

1.2 Different prosodies derived from Beijing dataset

Uterrance: "Just recovered a fumble on ensuing kickoff."

In [2]:
    import sys 
    sys.path.append('/usr/local/lib/python2.7/site-packages')
    from scipy.io import wavfile 
    import IPython
    import IPython.display as ipd
    import numpy as np
    
        
    # Token examples
    dirn = './Tokens/'
    # Firstdataset
    path = dirn + 'otherDataset/'

    for i in range(4):
        print('Style'+ str(i+1)+ ': ')
        Bsrc_path = path + 'Token' + str(i+1) + '_16bitPCM.wav'       
        fs, src_waveform = wavfile.read(Bsrc_path)
        IPython.display.display(ipd.Audio(src_waveform, rate=fs))
        
    
Style1: 
Style2: 
Style3: 
Style4: 

1.3 Different prosodies derived from Blizzard 2013 dataset

Uterrance: "Just recovered a fumble on ensuing kickoff."

In [57]:
    import sys 
    sys.path.append('/usr/local/lib/python2.7/site-packages')
    from scipy.io import wavfile 
    import IPython
    import IPython.display as ipd
    import numpy as np
    
        
    # Token examples
    dirn = './Tokens/'
    # Firstdataset
    path = dirn + 'Blizzards/'

    for i in range(5):
        print('Style'+ str(i+1)+ ': ')
        Bsrc_path = path + 'token' + str(i+1) + '_16bitPCM.wav'       
        fs, src_waveform = wavfile.read(Bsrc_path)
        IPython.display.display(ipd.Audio(src_waveform, rate=fs))
        
Style1: 
Style2: 
Style3: 
Style4: 
Style5: 

2. Prosody Transfer

In this section, we show the results of prosody transfer from a reference utterance to a generated utterance.

2.1 Same content

The linguistic content of the reference and generated utterances is the same.

Example 1

Utterance content: My mother always took him to the town on a market day in a light gig.

In [8]:
    import sys 
    sys.path.append('/usr/local/lib/python2.7/site-packages')
    from scipy.io import wavfile 
    import IPython
    import IPython.display as ipd
    import numpy as np
    
        
    # Token examples
    dirn = './ProsodyTransfer/'
    # Firstdataset
    path = dirn + 'parallel/example1/'

    name = ['ref','transfer1_neutral','prosodyT_16bitPCM']
    content = ['Refence utterance:','No (neutral) prosody transfer:','With prosody transfer:']
    for i in range(3):
        print(content[i])
        Bsrc_path = path + name[i] + '.wav'       
        fs, src_waveform = wavfile.read(Bsrc_path)
        IPython.display.display(ipd.Audio(src_waveform, rate=fs))
    
    
Refence utterance:
No (neutral) prosody transfer:
With prosody transfer:

Example 2

Utterance content: So we never saw Dick any more.

In [9]:
    import sys 
    sys.path.append('/usr/local/lib/python2.7/site-packages')
    from scipy.io import wavfile 
    import IPython
    import IPython.display as ipd
    import numpy as np
    
        
    # Token examples
    dirn = './ProsodyTransfer/'
    # Firstdataset
    path = dirn + 'parallel/example2/'

    name = ['ref','transfer1_neutral','prosodyT_16bitPCM']
    content = ['Refence utterance:','No prosody transfer:','With prosody transfer:']
    for i in range(3):
        print(content[i])
        Bsrc_path = path + name[i] + '.wav'       
        fs, src_waveform = wavfile.read(Bsrc_path)
        IPython.display.display(ipd.Audio(src_waveform, rate=fs))
Refence utterance:
No prosody transfer:
With prosody transfer:

Example 3

Utterance content: You will be to visit me in prison with a basket of provisions, you will not refuse to visit me in prison?

In [4]:
    import sys 
    sys.path.append('/usr/local/lib/python2.7/site-packages')
    from scipy.io import wavfile 
    import IPython
    import IPython.display as ipd
    import numpy as np
    
        
    # Token examples
    dirn = './ProsodyTransfer/'
    # Firstdataset
    path = dirn + 'parallel/example3/'

    name = ['ref','neutral_16bitPCM','prosodyT_16bitPCM']
    content = ['Refence utterance:','No prosody transfer:','With prosody transfer:']
    for i in range(3):
        print(content[i])
        Bsrc_path = path + name[i] + '.wav'       
        fs, src_waveform = wavfile.read(Bsrc_path)
        IPython.display.display(ipd.Audio(src_waveform, rate=fs))
Refence utterance:
No prosody transfer:
With prosody transfer:

2.2 Different content

In the following, the linguistic content of the reference utterance is different from that of the generated utterance.

Example 1

In [5]:
    import sys 
    sys.path.append('/usr/local/lib/python2.7/site-packages')
    from scipy.io import wavfile 
    import IPython
    import IPython.display as ipd
    import numpy as np
    
        
    # Token examples
    dirn = './ProsodyTransfer/'
    # Firstdataset
    path = dirn + 'unparallel/example1/'

    name = ['ref','transfer_16bitPCM','transfer1_neutral','transfer2_16bitPCM','transfer2_neutral']
    content = ['Reference utterance: ', 'With prosody transfer: ','No prosody transfer: ','With prosody transfer: ','No prosody transfer: ']
    textc = ['My mother always took him to the town on a market day in a light gig.', 'So we never saw Dick any more.', 'Just recovered a fumble on ensuing kickoff.']
    c = 0
    for i in range(5):
        print(content[i])
        
        Bsrc_path = path + name[i] + '.wav'       
        fs, src_waveform = wavfile.read(Bsrc_path)
        IPython.display.display(ipd.Audio(src_waveform, rate=fs))
        if i in [0,1,3]:
            print('Content: '+ textc[c])
            c += 1
        
        
        print('\n')
Reference utterance: 
Content: My mother always took him to the town on a market day in a light gig.


With prosody transfer: 
Content: So we never saw Dick any more.


No prosody transfer: 

With prosody transfer: 
Content: Just recovered a fumble on ensuing kickoff.


No prosody transfer: 

Example 2

In [6]:
    import sys 
    sys.path.append('/usr/local/lib/python2.7/site-packages')
    from scipy.io import wavfile 
    import IPython
    import IPython.display as ipd
    import numpy as np
    
        
    # Token examples
    dirn = './ProsodyTransfer/'
    # Firstdataset
    path = dirn + 'unparallel/example2/'

    name = ['ref','transfer_16bitPCM','transfer1_neutral','transfer2_16bitPCM','transfer2_neutral']
    content = ['Reference utterance: ', 'With prosody transfer: ','No prosody transfer: ','With prosody transfer: ','No prosody transfer: ']
    textc = ['You will be to visit me in prison with a basket of provisions, you will not refuse to visit me in prison?','My mother always took him to the town on a market day in a light gig.',"There was nothing disagreeable in Mister Rushworth\'s appearance."]
    c = 0
    for i in range(5):
        print(content[i])
        Bsrc_path = path + name[i] + '.wav'       
        fs, src_waveform = wavfile.read(Bsrc_path)
        IPython.display.display(ipd.Audio(src_waveform, rate=fs))
        if i in [0,1,3]:
            print('Content: '+ textc[c])
            c += 1
        print('\n')
Reference utterance: 
Content: You will be to visit me in prison with a basket of provisions, you will not refuse to visit me in prison?


With prosody transfer: 
Content: My mother always took him to the town on a market day in a light gig.


No prosody transfer: 

With prosody transfer: 
Content: There was nothing disagreeable in Mister Rushworth's appearance.


No prosody transfer: 

Example 3

In [7]:
    import sys 
    sys.path.append('/usr/local/lib/python2.7/site-packages')
    from scipy.io import wavfile 
    import IPython
    import IPython.display as ipd
    import numpy as np
    
        
    # Token examples
    dirn = './ProsodyTransfer/'
    # Firstdataset
    path = dirn + 'unparallel/example3/'

    name = ['ref','transfer_16bitPCM','transfer1_neutral','transfer2_16bitPCM','transfer2_neutral']
    content = ['Reference utterance: ', 'With prosody transfer: ','No prosody transfer: ','With prosody transfer: ','No prosody transfer: ']
    textc = ['There was nothing disagreeable in Mister Rushworth\'s appearance, and Sir Thomas was liking him already.','Just recovered a fumble on ensuing kickoff.','My mother always took him to the town on a market day in a light gig.']
    c = 0
    for i in range(5):
        print(content[i])
        Bsrc_path = path + name[i] + '.wav'       
        fs, src_waveform = wavfile.read(Bsrc_path)
        IPython.display.display(ipd.Audio(src_waveform, rate=fs))
        if i in [0,1,3]: 
            print('Content: '+textc[c])
            c += 1
        print('\n')
Reference utterance: 
Content: There was nothing disagreeable in Mister Rushworth's appearance, and Sir Thomas was liking him already.


With prosody transfer: 
Content: Just recovered a fumble on ensuing kickoff.


No prosody transfer: 

With prosody transfer: 
Content: My mother always took him to the town on a market day in a light gig.


No prosody transfer: 

In [ ]:
 
In [ ]:
 
In [ ]: